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Graphical modelling has a long history in statistics as a tool for the analysis 
of multivariate data, starting from Wright's path analysis [70] and Gibbs' ap- 
plications to statistical physics 29 at the beginning of the last century. In its 
modern form, it was pioneered by Lauritzen and Wermuth [40j and Pearl |52j in 
the 1980s, and has since found applications in fields as diverse as bioinformatics 
[2"5] , customer satisfaction surveys [31] and weather forecasts pQ . 

Genetics and systems biology are unique among these fields in the dimension 
of the data sets they study, which often contain several hundreds of variables and 
only a few tens or hundreds of observations. This raises problems in both com- 
putational complexity and the statistical significance of the resulting networks, 
collectively known as the "curse of dimensionality" . Furthermore, the data them- 
selves are difficult to model correctly due to the limited understanding of the 
underlying mechanisms. In the following, we will illustrate how such challenges 
affect practical graphical modelling and some possible solutions. 

1 Background and Notation 

Graphical models [351 [55] are a class of statistical models for representing the 
probabilistic relationships among a large number of variables with graphical 
structures and performing probabilistic inference with those variables. Graph- 
ical models are composed by a set X = {X\, X2, ■ ■ ■ , X p } of random variables 
describing the quantities of interest and a graph Q = (V, E) in which each node 
or vertex v € V is associated with one of the random variables in X. The edges 
e 6 E are used to express direct dependence relationships among the variables in 
X. The set of these relationships is often referred to as the dependence structure 
of the graph. Different classes of graphs express these relationships with different 
semantics, which have in common the principle that graphical separation of two 
vertices implies the conditional independence of the corresponding random vari- 
ables [52] . Examples most commonly found in literature are Markov networks 
[TH1IB3], which use undirected graphs; chain graphs, which use partially directed 
graphs; and Bayesian networks [3SED]: which use directed acyclic graphs. 

In principle, there arc many possible choices for the joint distribution of X, 
depending on the nature of the data and the aims of the analysis. However, 
literature have focused mostly on two cases: the discrete case [311 169] . in which 
both X and the Xi are multinomial random variables, and the continuous case 
[28, 69 , in which X is multivariate normal and the Xi are univariate normal 
random variables. In the former, the parameters of interest are the conditional 



probabilities associated with each variable, usually represented as conditional 
probability tables; in the latter, the parameters of interest are the partial corre- 
lation coefficients between each variable and its neighbours in Q. 

The estimation of the structure of the graph Q is called structure learning 
|19[ 136) , and consists in finding the graph structure that encodes the conditional 
independencies present in the data. Ideally it should coincide with the depen- 
dence structure of X, or it should at least identify a distribution as close as 
possible to the correct one in the probability space. Several algorithms have 
been presented in literature for this problem, thanks to the application of re- 
sults from probability, information and optimisation theory. Despite differences 
in theoretical backgrounds and terminology, they can all be traced to three ap- 
proaches: constraint-based (which are based on conditional independence tests), 
score-based (which are based on goodness-of-fit scores) and hybrid (which com- 
bine the previous two approaches). For some examples, see Bromberg et al. 
[§], Castelo and Roverato [12] . Friedman et al. [37], Larrahaga et al. and 
Tsamardinos et al. 1681 . 




All these structure learning algorithms operate under a set of common as- 
sumptions: 



— there must be a one-to-one correspondence between the nodes in the graph 
and the random variables in X; this means in particular that there must not 
be multiple nodes which are deterministic functions of a single variable; 

— observations must be independent. If some form of temporal or spatial de- 
pendence is present, it must be specifically accounted for in the definition of 
the network, as in dynamic Bayesian networks [36] : 

— every combination of the possible values of the variables in X must represent 
a valid, observable (even if really unlikely) event. This assumption implies 
a strictly positive global distribution, which is needed to have a uniquely 
identifiable model. 

The structure of a graphical model has two important properties. First, it 
defines the decomposition the probability distribution of X, called the global 
distribution, into a set of local distributions. For practical reasons, each local 
distribution should involve only a small number of variables. This decomposition 
is known as the Markov property of graphical models, and is a key property in 
applying graphical modelling to high dimensional problems. Second, the Markov 
blanket of each node can be easily identified from the structure of the graph. 
For instance, in Bayesian networks the Markov blanket of a node Xi is the set 
consisting of the parents of Xi , the children of Xi and all the other nodes sharing 
a child with Xi [35] . Since the Markov blanket is defined as the set of nodes that 
makes the target node (i.e. Xi) independent from all the other nodes in X, it 
provides a theoretically-sound solution to the feature selection problem [37] . 




2 Data and Models in Statistical Genetics and Systems 
Biology 

In genetics and systems biology, graphical models are employed to describe and 
identify interdependencies among genes and gene products, with the eventual 
aim to better understand the molecular mechanisms that link them. Data com- 
monly made available for this task by current technologies fall into three groups: 

1. gene expression data (251 162j . which measure the intensity of the activity 
of a particular gene through the presence of messenger RNA (mRNA, for 
protein-coding genes) or other kinds of non-coding RNA (ncRNA, for non- 
coding genes); 

2. protein signalling data [5B], which measure the proteins produced as a result 
of each gene's activity; 

3. sequence data [17], which provide the nucleotide sequence of each gene. For 
both biological and computational reasons, such data contain mostly single- 
nucleotide polymorphisms (SNPs) - genes which vary in only one nucleotide 
between individuals - having only two possible alleles, called biallelic SNPs. 

In the case of gene expression and protein signalling data, we are interested 
in reconstructing the functional pathways involved in some molecular process. 
Bayesian networks are naturally suited to this task. If we assign each gene to one 
node in the network, edges represent the interplay between different genes. They 
can describe cither direct interactions or indirect influences that are mediated 
by unobserved genes. This is a crucial property because it is impossible in prac- 
tice to completely observe a complex molecular process: either we do not know 
all the genes involved or we may be unable to obtain reliable measurements of 
all their expression levels. Therefore, the structure of a Bayesian network pro- 
vides an intuitive representation of the pathways involved in the process that 
can accommodate the noisiness inherent to biological data. Furthermore, under 
appropriate conditions [36 , 53 edge directions may be indicative of the causal 
relationships in the underlying pathways. In that case, the Bayesian network re- 
flects the ordering of connections between pathway components and the actual 
flow of the molecular process. 

Similar considerations can be made when protein signalling data are used 
just to identify protein-protein interactions, limiting ourselves to the study of 
the cell's physiology. 

On the other hand, in sequence data analysis we are interested in modelling 
the behaviour of one or more phenotypic traits (e.g. the presence of a disease 
in humans, yield in plants, milk production in cows) by capturing direct and 
indirect causal genetic effects. Unless some prior knowledge on the genetic ar- 
chitecture of a trait is available, a large set of genes spread over the whole genome 
is required for such effects to be detectable. If the focus is on identifying the genes 
that are strongly associated with a trait, the analysis is called a genome-wide 
association study (GWAS). In plants and animal genetics, we may also be inter- 
ested in predicting a trait accurately for the purpose of implementing a selection 



program (i.e. to decide which plants or animals to cross so that the offspring 
exhibit). In this case, we speak about genomic selection (GS) models. 

Applications of Bayesian networks to sequence data are more problematic 
than in the previous cases; some care must be taken in their interpretation as 
causal models. Edges linking genes to a trait can be considered direct associa- 
tions. As was the case for gene expression data, under appropriate conditions 
such associations may actually be indicative of real causal effects. On the other 
hand, edges linking genes to other genes arise from the genetic structure of the 
individuals in the sample. It is expected, for example, that genes that are lo- 
cated near each other on a chromosome are more likely to be inherited together 
during meiosis, and are therefore said to be genetically linked [20 . Furthermore, 
even genes that are far apart in the genome can be in linkage disequilibrium 
(LD) if some of their configurations occur more often or less often than would 
be expected from their marginal frequencies. Both these phenomena induce as- 
sociations between the genes, but not cause-effects relationships. From a strictly 
causal point of view, a chain graph in which genes are linked by undirected edges 
and the only directed edges are the ones incident on the traits provides a better 
visual representation of the network structure. 

2.1 Gene Expression Data 

Gene expression data are typically composed of a set of intensities measuring 
the abundance of several RNA patterns, each meant to probe a particular gene. 
These intensities are measured either radioactively or fluorescently, using labels 
that mark the desired RNA patterns 18, 43, 45 . 

The measured abundances present several limitations. First of all, microar- 
rays measure abundances only in terms of relative probe intensities, not on an 
absolute scale. As a result, comparing different studies or including them in a 
meta-analysis is difficult in practice without the use of rank-based methods [H] . 
Furthermore, even within a single study abundance measurements are system- 
atically biased by batch effects introduced by the instruments and the chemical 
reactions used in collecting the data [59j . 

By their nature, gene expression data are modelled as continuous random 
variables and are investigated using Pearson's correlation, either assuming a 
Gaussian distribution or applying results from robust statistics [32j [66]. The 
simplest graphical models used for gene expression data are relevance networks 
[10] . also known in statistics as correlation graphs. Relevance networks are con- 
structed by estimating the correlation matrix of the genes and thresholding its 
elements, so that weak correlations are set to zero. Finally, a graph is drawn in 
order to depict the remaining strong correlations. 

Covariance selection models |17j . also known as concentration graphs or 
graphical Gaussian models |69j , consider conditional rather than marginal depen- 
dencies; the presence of an edge is determined by the value of the corresponding 
partial correlation and not of the marginal one. In the context of systems bi- 
ology, the resulting graphs are often called gene association networks, and are 
not trivial to estimate from high-dimensional genomic data. Several solutions 
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Fig. 2.1. A Bayesian network learned from gene expression data and used as an exam- 
ple in Friedman [22] . Grey nodes correspond to the regulators of the network, the genes 
controlling the expression of the other (target) genes involved in a molecular process. 



are proposed in literature, based either on James-Stein regularisation [57J HE] or 
on different penalised maximum likelihood approaches [5J [STJ 33] . 

Both gene relevance and gene association networks are undirected graphs. 
The application of Bayesian networks to learn large-scale directed graphs from 
microarray data was pioneered by Friedman et al. |26j . and has also been re- 



viewed more recently in Friedman [22] (see Figure 2.1). The high dimensionality 
of the model means that inference procedures are usually unable to identify a 
single best Bayesian network, settling instead on a set of equally well behaved 
models. For this reason, it is important to incorporate prior biological knowledge 
into the network through the use of informative priors [48] . 



2.2 Protein Signalling Data 

Protein signalling data are similar to gene expression data in many respects, and 
are in fact often used to investigate indirectly the expression of a set of genes. In 
general, the relationships between proteins are indicative of their physical loca- 
tion within the cell and of the development over time of the molecular processes 
they are involved in. Understanding these associations may improve our under- 
standing of cellular physiology and may lead to the identification of unknown 
proteins and their functions. 

can 



From a modelling perspective, all the approaches covered in Section 2.1 



be applied to protein signalling data with little to no change. However, it is 




Fig. 2.2. The Bayesian network learned from the protein-signalling data in Sachs et 
al. [56] using model averaging and data from several experiments performed under 
different stimulatory and inhibitory cues. 



important to note that protein signalling data sometimes have sample sizes that 
are much larger than either gene expression or sequence data; an example is 
the study from Sachs et al. on how to derive a causal Bayesian network from 
multi-parameter single-cell data (Figure 2.2). 



2.3 Sequence Data 

Sequence data are fundamentally different from both gene expression and protein 
signalling data, for several reasons. First, sequence data provide direct access 
to the genome's information, without relying on indirect measurements. As a 
result, they provide a closer view of the genetic layout of an organism than 
other approaches. Second, sequence information is intrinsic to each individual, 
and does not vary over time; therefore, the inability of static Bayesian networks 
to model feedback loops is not a limitation in this case. 

Furthermore, sequence data is naturally defined on a discrete rather than 
continuous domain. Each gene has a finite number of possible states, determined 
by the number of combinations of nucleotides differing between the individuals 
in the sample. In the case of biallelic SNPs, each SNP X t differs at a single 
base-pair location and has only three possible variants. They are determined by 
the (unordered) combinations of the two nucleotides observed at that location, 
called the alleles, and are often denoted as "AA" , "Aa" , "aa" . The "A" and "a" 
labels can be assigned to the nucleotides in several ways; for instance, "A" can be 



chosen as either the one most common in the sample (which makes models easier 
to interpret) or by following the alphabetical order of the nucleotides (which 
makes the labelling independent from the sample). "A A" and "aa" individuals 
are said to be homozygotes, because both nucleotides in the pair have the same 
allele; "Aa" individuals are said to be hetero zygotes. 

From a graphical modelling perspective, modelling each SNP as a discrete 
variable is the most convenient option; multinomial models have received much 
more attention in literature than Gaussian or mixed ones. On the other hand, 
the standard approach in genetics is to recode the alleles as numeric variables, 
e.g. 

1 if the SNP is "AA" ( 2 if the SNP is "AA" 

Xi= { if the SNP is "Aa" or X t = \ 1 if the SNP is "Aa" . (2.1) 
-1 if the SNP is "aa" [ if the SNP is "aa" 

In the former, "Aa" taken as a reference to measure the effects of the two ho- 
mozygotes. In the latter, encodes the number of copies of the "A" allele in 
the individual. In both cases, the recoded variables are typically modelled using 
an additive Bayesian linear regression model of the form 

n 

y = H + ^2Xigi+£, fli~T ffi , e~iV(0,E) (2.2) 

i=l 

where gi denotes the effect of gene Xi, y is the trait under study and fi is the 
population mean. The matrix X models the rclatedness of the subjects, which 
is called kinship in genetics, and populations structure g]. In human genetics, 
it is often assumed to be the identity matrix, which implies the assumptions 
that individuals are unrelated. Several implementations of Equation |2.2| based 
on linear mixed models and penalised regressions have been proposed, mostly 
within the framework of Bayesian statistics. Some examples are the Genomic 
BLUP (GBLUP), BayesA and BayesB from Meuwissen et al. @B], the Bayesian 
LASSO from Park and Casella [5T] and the BayesC7r from Habier et al. [IB] . 

Graphical models, and Bayesian networks in particular, provide a systematic 
way to categorise and extend such models. Consider the four different models 



shown in Figure 2.3 The classic additive model from Equation 2.2 is shown in 
the top-left panel; SNPs are independent from each other and all contribute in 
explaining the behaviour of the phenotypic trait. This is the case for BayesA and 
GBLUP. In the top-right panel, some SNPs are identified as non-significant and 
excluded from the additive model. Models of this kind include BayesB, BayesC7r 
and the Bayesian LASSO, which perform feature selection in the context of 
model estimation. 

A natural way to extend these models is to include interactions between the 
SNPs, as shown in the two bottom panels. A recent study by Morota et al. [UJ 
has shown that assuming additive effects can only be justified on the grounds of 
computational efficiency, because interactions between the SNPs are so complex 
that even pairwise dependence measures are not able to capture them completely. 
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Fig. 2.3. Different approaches to GS and GWAS. On the top, classic additive Bayesian 
linear regression models with and without feature selection. On the bottom, more 
complex models based on Bayesian networks. 



On the other hand, Bayesian networks provide a more accurate picture of these 
dependencies and are more effective at capturing and displaying them, ff the 
trait is discrete, Bayesian network classifiers [53] such as the Tree-Augmented 
Naive Bayes (TAN) can also be used to implement GS and GWAS models. 

3 Challenges in Bayesian Network Modelling 

Gene expression, protein signalling and sequence data are difficult to analyse 
in a rigorous and effective way regardless of the model used, as they present 
significant computational and statistical challenges. The combination of small 
sample sizes and large numbers of variables (n < p), often called the "curse of 
dimensionality" , requires careful handling in model specification and algorithm 
implementation. This is especially true for Bayesian networks, because both 
learning and inference are NP-hard [141 II 5] and scale quadratically in the number 
of variables. 

3.1 Limits of the "Small n, Large p" Data Sets 



The disparity between the available sample sizes and the number of genes or pro- 
teins under investigation is probably the most important limiting factor in genet- 



ics and systems biology. In a few cases, the underlying phenomenon is known to 
the extent that only the relevant variables are included in the model (Sachs et al. 
[56] is one such study). However, in general molecular processes are so complex 
that statistical modelling is used more as a tool for exploratory analysis than to 
provide mechanistic explanations. In the former case, we have that n>p, and 
we can use results from large-sample theory [41] and computationally-intensive 
techniques [6JQT] in selecting and estimating our models. In the latter, the limits 
of the model depend heavily on the available knowledge on the phenomenon and 
on our ability to incorporate it in the prior. 

Consider, following Bayes' theorem, the posterior distribution of the param- 
eters in the model (say 0) given the data 

p(0 | X) cx p(X | 9) ■ p(0) = L(0; X) • p(0) (3.1) 

or, equivalently, 

logp(0 | X) =c + logL(0;X)+logp(0). (3.2) 

The log-likclihood, logL(0;X), is a function of the data and therefore scales 
with the sample size, while the prior density does not. Therefore, as the sample 
size increases, the information present in the data dominates the information 
provided in the prior and determines the overall behaviour of the model. For 
small sample sizes, the prior distribution plays a much larger role because there 
is not enough data available to disprove the assumptions the prior encodes. As 
a result, conclusions arising from model estimation and inference reflect our 
beliefs on the phenomenon (as encoded in the prior) more than the reality of the 
observed molecular processes. In this context, even the use of non-informative 
priors (which are designed to be as little informative as possible) may result in 
posteriors with undesirable properties [TJ. 

In that regard, Bayesian networks present considerable advantages. First, 
they are more flexible than most of the classic models used in genetics and 
systems biology in specifying variable selection rates and interactions. In other 
words, the prior makes fewer assumptions on the probabilistic structure of the 
data and is therefore less likely to completely dominate the likelihood. Second, 
the effects of the values assigned to the parameters of a non-informative prior 
are well understood for both small and large samples (53] E3], and corrected 
posterior density functions are available in closed form. Third, 

Another important consideration is the ease of estimating the model. Mod- 
els used in genetics and systems biology often require expensive Markov Chain 
Monte Carlo simulations; two such examples are BayesA and BayesB. On the 
other hand, many closed form results are available for both discrete and Gaussian 
Bayesian networks. For networks up to 100 variables, exact structure learning 
algorithms are available [35] and exact inference algorithms such as Variable 
Elimination and Clique Trees [36] are feasible to use. For larger networks, effi- 
cient structure learning heuristics such as the Semi-Interleaved Hiton-PC from 
Aliferis et al. [5] |3] and approximate inference algorithms such as the Adaptive 
Importance Sampling for Bayesian Networks (AIS-BN) from Cheng and Druzdel 
[T3] are feasible up to several thousands of variables. 



3.2 Discrete or Continuous Variables? 



All the data types covered in Section [2] are often modelled using Gaussian 
Bayesian networks, which represent the natural evolution of the linear regression 
models used in literature. In the case of gene expression and protein signalling 
data, sometimes [301 ES] the data are discretised into intervals and a discrete 
Bayesian network is used instead. As for gene expression data, both Gaussian 
and discrete Bayesian networks can be used depending on whether we disregard 
the order of the alleles or not. 

Clearly, both distributional assumptions present important limitations. Gaus- 
sian Bayesian networks assume that the global distribution is multivariate nor- 
mal. This is unreasonable in the case of sequence data, which can only assume 
a finite, discrete set of values. Gene expression and protein signalling data, 
while continuous, are in general significantly skewed unless preprocessed with 
a Box-Cox transformation [71] , Furthermore, Gaussian Bayesian networks are 
only able to capture linear dependencies, and have a low power in detecting 
non-linear ones. On the other hand, using discrete Bayesian networks and as- 
suming a multinomial distribution disregards useful information present in the 
data. If the ordering of the intervals (for discretised gene expression and protein 
signalling data) or of the alleles (in sequence data) is ignored, both learning and 
subsequent inference are not aware that dependencies are likely to take the form 
of stochastic trends. This is true, in particular, for sequence data, as the effect 
of the heterozygous allele is necessarily comprised between the effect of the two 
heterozygous alleles. 

An approach that has the potential to outperform both discrete and Gaussian 
assumptions has been recently proposed by Musella [49] with Bayesian networks 
learned from ordinal data. Structure learning is performed using a constraint- 
based approach (in particular, the PC algorithm from Sprites et al. |61j ) using the 
Jonckheere-Terpstra test for trend among ordered alternatives [33J ISS] . Consider 
a conditional independence test for X\ _1L A3 | A2, where X%, A2 and A3 have 
T, L and C levels respectively. The test statistic is defined as 



L T i — 1 (J , , 

E_ n l+k (n i+k + 1) 



JT =EEE 

k=l i=2 j = l 



_s=l 

where the Wij S k are Wilcoxon scores, defined as 

s-l 



(3.3) 
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t=l 



, rii S k + n jsk + 1 
nuk + rijtk H 



(3.4) 



and has an asymptotic normal distribution with mean and variance defined in 
Lehmann [32] and Pirie [55] . The null hypothesis is that of homogeneity; if we 
denote with Fi } k{x 3 ) the distribution function of A3 | Ai = i, A2 = k, 

H ■ F lik {x 3 ) = F 2 , k (x 3 ) = ... = F T . k {x 3 ) for Vx 3 and Vfc. 
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Fig. 3.1. Three patterns of SNP effects on a phenotypic trait: linear association (left), 
a dominant SNP (centre), a recessive SNP (right). 



The alternative hypothesis Hi = ffi.iUffi^ is that of stochastic ordering, either 
increasing 

'■ Fi,k{ x a) ^ Fj,k{ x 3) with i < j for Vi£3 and Vfc 

or decreasing 

Hi,2 ■ Fi,k(x3) ^ Fj,k(%3) with i < j for VX3 and Vfc. 

The advantages of the Jonckheere-Terpstra test compared to linear associ- 
ation can be illustrated, for example, by considering the different patterns of 
dominance of a single SNP shown in Figure |3.1| Due to the way SNPs are 
recoded as numeric variables, assuming that dependence relationships are lin- 
ear (left panel) forces the effect of heterozygotes to be the mean of the effects 
of the the respective homozygotcs. This is not always the SNPs can 

be dominant (centre) or recessive (right) for a trait, either singly or in groups 
[20] . Tests for linear association have very low power against such nonlinear 
alternative hypotheses. On the other hand, the alternative hypothesis of the 
Jonckheere-Terpstra test characterises correctly both dominant and recessive 
SNPs. Furthermore, the Jonckheere-Terpstra test exhibits more power than the 
independence tests used in discrete Bayesian networks because of the more spe- 
cific alternative hypothesis (e.g. stochastic ordering is just one particular case of 
stochastic dependence). 

3.3 Feature Selection a Data Pre-Processing Step 

It is not possible, nor expected, for all genes in modern, genome-wide data sets 
to be relevant for the trait or the molecular process under study. In part, this is 
because of the curse of dimensionality, but it is also because some genes provide 
essentially the same information due to linkage disequilibrium. Furthermore, the 
effects of some genes on a trait may be mediated by other genes, thus making 



them redundant. For this reason, in practice statistical models in systems biol- 
ogy and genetics require a feature selection to be performed, either during the 
learning process or as a separate data pre-processing step. 

In the context of GS and GWAS models, we aim to find the subset of genes 
S C X such that 



P(y|X)=P(y|S,X\S)«P(y|S), (3.5) 

that is, the subset of genes (S) that makes all other markers (X \ S) redundant 
as far as the trait y we are studying is concerned. Markov blankets identify 
such a subset in the framework of graphical models; several algorithms have 
been proposed in literature for their learning [2J [57] . After the set S has been 
identified, we can either fit one of the Bayesian linear regression models from 



Section 2.3 or learn a Bayesian network from y and S. In both cases, the smaller 
number of variables included in the model reduces the effects of the curse of 
dimensionality (by reducing p and making it closer to n). On the other hand, 
the conditional independence tests used by Markov blanket learning algorithms 
do not take kinship into account; for example, all asymptotic tests assume that 
observations are independent. Such an assumption is likely to be violated in 
animal and plant genetics, which make heavy use of inbred populations {e.g. 
in which offspring is produced from parents selected from the population itself 
and who are closely related) . Therefore, interpreting edges from S to y as direct 
causal influences may lead to spurious results, even when the model shows good 
predictive power [2J. 

As far as gene expression and protein signalling data are concerned, the prob- 
lem of feature selection is more complicated. In many cases, we are interested 
in a complex molecular process, as opposed to a single trait. If we don't know a 
priori at least some of the genes involved in the molecular process, performing 
feature selection as a data pre-processing step is impossible; we have to identify 
the pathways we are interested in from the structure of the Bayesian network 
learned from X. At most we can enforce sparsity in the network by using shrink- 
age tests [5U] or non- uniform structural priors [2"3] . 

Even if we know which genes are involved, using Markov blankets for feature 
selection presents significant drawbacks. The Markov blanket of each gene must 
be learned separately because all algorithms in literature accept only one target 
node. If no information is shared between different runs of the learning algo- 
rithm, this task is embarrassingly parallel but still computationally intensive. If, 
on the other hand, we use backtracking and other optimisations to share infor- 
mation between different runs, significant speed-ups are possible at the cost of 
an increased error rate (i.e. false positives and false negatives among the nodes 
included in each Markov blanket). In both cases, merging the Markov blankets 
of each gene into a single set requires the use of symmetry corrections [2, 68 
that violate the proofs of correctness of the learning algorithms. 

A better approach in this context is the feature selection algorithm proposed 
by Pena et al. [53] . First, the algorithm is designed to identify all the all the nodes 
in a graphical model which are relevant to compute the conditional probability 



distribution for a given set of variables. Therefore, a single run is able to perform 
feature selection for all the nodes in the pathway. Second, the algorithm only 
requires pairwise measures of dependence, and is therefore computationally and 
sample efficient. This is in contrast with Markov blanket learning algorithms, 
which require independence tests involving large sets of conditioning variables. 

4 Conclusions 

Data sets in genetics and systems biology often contain several hundreds of vari- 
ables and only a few tens or hundreds of observations. This raises problems in 
both computational complexity and the statistical significance of the resulting 
networks, which are collectively known as the "curse of dimensionality". Fur- 
thermore, the data themselves are difficult to model correctly due to the limited 
understanding of the underlying mechanisms. Bayesian networks provide a very 
flexible framework to model such data, extending, complementing or replac- 
ing classic models present in literature. Their flexibility in incorporating prior 
knowledge, different parametric assumptions and different dependence struc- 
tures makes them a suitable choice for the analysis of gene expression, protein 
signalling and sequence data. 
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